The Thera Bank - credit card users churn prediction - feature selection, model selection and tuning

Context:

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas.

Objective:

  1. Explore and visualize the dataset.
  2. Build a classification model to predict if the customer is going to churn or not.
  3. Optimize the model using appropriate techniques.
  4. Generate a set of insights and recommendations that will help the bank.

Data Dictionary:

  1. CLIENTNUM: Client number. Unique identifier for the customer holding the account
  2. Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
  3. Customer_Age: Age in Years
  4. Gender: Gender of the account holder
  5. Dependent_count: Number of dependents
  6. Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to a college student), Post-Graduate, Doctorate.
  7. Marital_Status: Marital Status of the account holder
  8. Income_Category: Annual Income Category of the account holder
  9. Card_Category: Type of Card
  10. Months_on_book: Period of relationship with the bank
  11. Total_Relationship_Count: Total no. of products held by the customer
  12. Months_Inactive_12_mon: No. of months inactive in the last 12 months
  13. Contacts_Count_12_mon: No. of Contacts between the customer and bank in the last 12 months
  14. Credit_Limit: Credit Limit on the Credit Card
  15. Total_Revolving_Bal: The balance that carries over from one month to the next is the revolving balance
  16. Avg_Open_To_Buy: Open to Buy refers to the amount left on the credit card to use (Average of last 12 months)
  17. Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  18. Total_Trans_Ct: Total Transaction Count (Last 12 months)
  19. Total_Ct_Chng_Q4_Q1: Ratio of the total transaction count in 4th quarter and the total transaction count in 1st quarter
  20. Total_Amt_Chng_Q4_Q1: Ratio of the total transaction amount in 4th quarter and the total transaction amount in 1st quarter
  21. Avg_Utilization_Ratio: Represents how much of the available credit the customer spent

Import required libraries

Define all required functions

Load the dataset

Understand the data

Shape of data

Dataset information

Observations:

  1. There are 21 columns in the dataset.
  2. Education_Level, Marital_Status columns have less than 10127 values, indicating existance of NULL Values.
  3. All categorical columns are of object type and can be used for onehot encoding.

Sample data

Observations:

  1. CLIENTNUM being customer identifier will not add value to our model building and can be dropped during dependent variable spilt.
  2. Attrition_Flag is our target column, we will have encode the values before model building.
  3. Income_Category column needs to be reviewed and feature engineering if required can be done on the column.

Check the null values

Observations:

  1. Education_Level has 15% (1519) missing values.
  2. Marital_Status has 7.4% (749) missing values.

Check the duplicated values

Observations:

There are no duplicated values.

Check the basic statistics

Observations:

  1. The features are on different scales, for instance Months_on_book has 13 as minimum and Credit_Limit has 1438.
  2. CLIENTNUM has high uniform values on all scales as it customer identifier.
  3. For most variables mean is higher than standard deviation, depicting right sckewed data.

Check the distinct values for anomalies

Observations:

  1. Income_Category has 1112 records with "abc", we will have to correct this.
  2. Gender, Education_Level, Marital_Status and Card_Category columns doesn"t show any anomaly.

Data pre-processing

Impute null values for Marital_Status

Education_Level, Income_Category and Card_Category can be checked for influence on Marital_Status

Observations:

  1. We have cross validated missing value for Marital_Status with other variables like Education_Level, Card_Category and Income_Category.
  2. Though the validation shows that most of the customers may be married, but it is not clearly indicated.
  3. We will update the missing values using SimpleImputer with strategy as most_frequent.

Observations:

NULL Values are treated.

Impute null values for Education_Level

Income_Category can be checked for influence on Education_Level

Observations:

  1. We see that almost 30% Customers with missing Education_Level are from Income_Category "Less than $40K".
  2. and most customers with the Income_Category "Less than $40K" are Graduates.
  3. We can assume and update Education_Level to "Graduate", however post that update the data may become biased for one specific category.
  4. We will use simple imputer to impute the null values with most_frequent strategy.

Observations:

NULL Values are treated.

Check the null values

Observations:

We see that all null values are now treated

Process the Income_Category Column

Check the distinct data sets

We understand that Education_Level is a great influencer for Income_Category, let us cross validate

Observations:

  1. This is clearly a data input error.
  2. We are not able to accertain which category to replace "abc".
  3. Since all income categories are covered in the variable, it is clear the NULL Values are represented as "abc" due to data capture.

Verify the update

Observations:

Income_Category "abc" is now treated and imputed with most_frequent value "Less than $40k"

Encode the target column

Observations:

The target columne is now encoded with Existing Customers as 1 and Attrited Customers as 0.

Univariate Analysis

CLIENTNUM

Observations:

  1. CLIENTNUM shows uneven distirbution of data and appears to have outliers.
  2. CLIENTNUM is customer identifier in the dataset and would not be used in our analysis.
  3. No handling required for the variable.

Attrition_Flag - target variable

Observations:

Target variable shows 83.9% customers are existing and 16.1% are attrited.

Customer_Age

Observations:

  1. Mean is slightly smaller than median.
  2. The data looks normally distributed.
  3. There is one customer with age of 73 years, it is possible case and is not an outlier.

Gender

Observations:

Number of Female customers is higher than Male customers.

Dependent_count

Observations:

Most customers have 3 dependents followed by customers with 2 and 1 dependents.

Education_Level

Observations:

Most customers are Graduate followed by High School.

Marital_Status

Observations:

  1. Around 53% customers are married.
  2. Next highest population is Single customers.

Income_Category

Observations:

  1. Most customers earn Less than $40K.
  2. Followed by customers earning between $40K to $60K.

Card_Category

Observations:

  1. Most customers in the dataset have "Blue" Card.
  2. There are very few customers using "Gold" and "Platinum" Cards.
  3. We also see some customers using "Silver" Card.

Months_on_book

Observations:

  1. Mean is almost equal to median and the data looks well distributed on both sides.
  2. There are customers who are assocated with bank for almost 5 years.

Total_Relationship_Count

Observations:

  1. Around 1/4th of customers have 3 relationships with bank.
  2. Customers with 4,5 and 6 relationships are almost equal in numbers.

Months_Inactive_12_mon

Observations:

  1. 38% of customers have not used the card for 3 months in last 12 months.
  2. The data shows that a small fraction of customers are active every month.
  3. There are around 1.2% of customers who have not used there cards for 6 months in last 12 months.

Contacts_Count_12_mon

Observations:

  1. 33.4% customers had interacted 3 times with bank staff in last 12 months, followed by 31.9% customers who were contacted 2 times.
  2. Very small percentage of customers have interacted for 6 times in last 12 months.

Credit_Limit

Observations:

  1. Mean is greater than median depicting right skewed data.
  2. From our domain knowledge we know that some customers will have higher credit limit compared to others.
  3. The data looks normal for the domain and no treatment required.

Total_Revolving_Bal

Observations:

  1. Mean is smaller than median shows slight left skewness.
  2. From the domain knowledge we know that customers may or may not use the revolving credit / balance feature on their cards.
  3. The data looks right fit for domain and no treatment required.

Avg_Open_To_Buy

Observations:

  1. Mean is greater than median depicting right skewed data.
  2. From our domain knowledge we know that customers may or may not utlise the entire limit available on card.
  3. The data looks normal for the domain and no treatment required.

Total_Amt_Chng_Q4_Q1

Observations:

  1. Mean is slightly greater than median.
  2. The variable represents a comparision between utilisation of Q4 as compared to Q1.
  3. The data looks normal for the domain and no treatment required.

Total_Trans_Amt

Observations:

  1. Mean is slightly greater than median.
  2. Some customers are likely to spend more on credit card, which makes data ligitimate and can be used as is.
  3. The data looks normal for the domain and no treatment required.

Total_Trans_Ct

Observations:

  1. Mean is slightly smaller than median.
  2. The data largely looks balanced.
  3. Few customers have used their card more than the other customers.
  4. The data looks good and no treatment required.

Total_Ct_Chng_Q4_Q1

Observations:

  1. Mean is almost equal to median.
  2. The variable represents a comparision between utilisation of Q4 as compared to Q1.
  3. The data looks normal for the domain and no treatment required.

Avg_Utilization_Ratio

Observations:

  1. Mean is greater than median, the data looks right skewed.
  2. Some customers are likely to spend more on credit card, which makes data ligitimate and can be used as is.
  3. The data looks normal for the domain and no treatment required.

Bivariate Analysis

Comparision of all numerical variables with Attrition_Flag

Observations:

  1. Existing customers have slightly larger Credit_Limits.
  2. Existing customers have larger Revolving_Balance.
  3. Transaction_Amounts of existing customers are higher than attrited customers, which indicates some customers may have left as they may be using other credit cards OR they rarely use Credit Cards for purchases.
  4. Average Utilisation is higher in existing customers.
  5. Inactivity during last 12 months has very less role to play.
  6. Customer_Age is also not impacting the attrition.

Gender VS Attrition_Flag

Observations:

Number of Male Customers closing the Credit Card account is slighlty lesser than Female Customers.

Education_Level VS Attrition_Flag

Observations:

  1. People with Doctorate degree tend to leave the Credit Card program more than other Education_Levels.
  2. Education_Level other than Doctorate dosent show significant impact on attrition.
  3. Data entry error / missing data cases should be reduced for future data capture to avoid NULL Values thus resulting into seperate category.

Marital_Status VS Attrition_Flag

Observations:

  1. Married Customers are less likely to close the credit card compared to other Statuses.
  2. It is recommended that business captures more accurate information for the variable in future to avoid NULL Values.

Income_Category VS Attrition_Flag

Observations:

  1. Customers with income greater that $120K and less than $40K are more likely to close the credit card.
  2. Customers within income range of $60K and $80K are less likely to leave.
  3. Data Entry error should be avoided to have more meaningful insights.

Card_Category VS Attrition_Flag

Observations:

Platinum card members are most likely to close the credit card compared to other categories.

Multivariate Analysis

Observations:

  1. Avg_Utilization_Ratio shows significant correlation with Total_Revolving_Bal.
  2. Ang_Open_To_Buy is highly correlated to Credit_Limit.
  3. Customer_Age is highly correlated to Months_on_book.
  4. Avg_Utilization_Ratio is negatively correlated to Credit_Limit and Ang_Open_To_Buy.
  5. Total_Trans_Amt is highly correlated to Total_Trans_Ct as they are value and volume representation of Transactional Data.

Observations:

As we have seen during bivariate analysis there are no clear indicators of attrition as all variables show similar data for both existing customers and attrited customers.

Outliers Treatment

Draw boxplot for outliers in numerical columns

Validate the outliers using IQR

Observations:

  1. Credit_Limit, Contacts_Count_12_mon, Total_Trans_Ct and Avg_Open_To_Buy show heavy outliers.
  2. Month_on_book, Total_Amt_Chng_Q4_Q1, Total_Ct_Chng_Q4_Q1 and Months_Inactive_12_mon show few outliers.
  3. As we have seen during Univariate analysis, the variables does show outlers however these are valid business data and we should consider these in model_building.

Prepare Data for Model Building

Split data into Dependent and Independent Variables

Apply onehot encoding

Split data into Training, Validation and Test Sets

Training Set to have 60% data and Validation and Testing sets to have 20% data each

Create oversampled training set

Create undersampled training set

Model Building

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer will close the Credit Card Account and the customer doesn"t - Loss of resources.
  2. Predicting a customer will not close ther Credit Card Account and the customer does - Loss of revenue / business.

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Identify the models to be built

Build different models on Training Set using KFold and cross_val_score.

Boxplot of scores

Observations:

  1. We see that Logistic Regression, Random Forest, XGBoost and Decision Tree models are over fiting on the training set.
  2. However the cross_validation has reduced the over fitting to some extent.
  3. Random Forest, Gradient Boosting and XGBoost gives best perfomance and we would want to tune them for further improvements.
  4. The boxplot shows that Gradient Boosting has 1 outlier and Random Forest has 2.
  5. Let us also build these models on Over and Under sampled datasets to validate our undertstanding.

Build different models on Oversampled Training Set using KFold and cross_val_score.

Boxplot of scores

Observations:

  1. We see that Logistic Regression, Random Forest, XGBoost and Decision Tree models are over fiting on the oversampled training set as well.
  2. However the cross_validation has reduced the over fitting.
  3. Random Forest, Gradient Boosting and XGBoost gives best perfomance and we would want to tune them for further improvements.
  4. Let us also build these models on Under sampled datasets to validate our undertstanding.

Build different models on Undersampled Training Set using KFold and cross_val_score.

Boxplot of scores

Observations:

  1. We see that Random Forest, XGBoost and Decision Tree models are over fiting on the undersampled training set as well.
  2. However the cross_validation has reduced the over fitting.
  3. Random Forest, Gradient Boosting and XGBoost gives best perfomance on undersampled training set.
  4. The boxplot shows that Gradient Boosting has 2 outliers and Random Forest has 1 outlier.

Compare model performance

Observations:

  1. Logistic regression gives best score on Training and Oversampled Training sets.
  2. But Random Forest, Gradient Boost and Xgboost gives us consistent scores on Training, Oversampled and Undersampled data.
  3. We have alse seen that these three models give us generalised performance on the validation set as well.
  4. We can continue with hyper tuning these models and see if the performance improves / generalises.
  5. We will build the hypertuned models on Training set, as all models give best recall on the same.

Hyperparameter Tuning

RandomForestClassifier

GridSearchCV

Define Parameters

Run Grid Search and fit the model

Use the best parameters and train a model

Evaluate the model performance

Draw Confusion Matrix on Validation Set

Observations:

  1. We see that Hypertuning with Best Parameters identied by GridSearchCV, of the RandomForestClassifier has improved the recall to 1 on training set.
  2. Performance is consistent on Validation and Testing set as well.
  3. Accuracy and Precision scores are also consistent.

RandomizedSearchCV

Run Randomized Search and fit the model

Use the best parameters and train a model

Evaluate the model performance

Draw Confusion matrix for Validation Set

Observations:

  1. We see that Hypertuning with Best Parameters identied by RandomizedSearchCV, of the RandomForestClassifier has improved the recall to 1 on training set.
  2. Performance is consistent on Validation and Testing set as well.
  3. Accuracy and Precision scores are also consistent.

XGBoostClassifier

Define parameters

GridSearchCV

Run Grid Search and fit the model

Use the best params and train a model

Evaluate the model performance

Draw Consufion matrix for Validation Set

Observations:

  1. We see that Hypertuning with Best Parameters identied by GridSearchCV, of the XGBoostClassifier has improved the recall to 1 on training set.
  2. Performance is consistent on Validation and Testing set as well.
  3. Accuracy and Precision scores are also consistent.

RandomizedSearchCV

Run Randomized Search and fit the model

Use the best parameters and train a model

Evaluate the model performance

Draw Confusion Matrix for Validation Set

Observations:

  1. We see that Hypertuning with Best Parameters identied by RandomizedSearchCV, of the XGBoostClassifier has improved the recall to 1 on training set.
  2. Performance is consistent on Validation and Testing set as well.
  3. Accuracy and Precision scores are also consistent.

GradientBoostClassifier

Define the parameters

Run Grid Search and fit the model

Use the best parameters and traing the model

Evaluate the model performance

Draw Confusion Matrix on Validation Set

Observations:

  1. We see that Hypertuning with Best Parameters identied by GridSearchCV, of the GradientBoostClassifier has improved the recall to 1 on training set.
  2. Performance is consistent on Validation and Testing set as well.
  3. Accuracy and Precision scores are also consistent.

RandomizedSearchCV

Run Randomized Search and fit the model

Use the best parameters and train the model

Evaluate the model performance

Draw the Confusion Matrix on Validation Set

Observations:

  1. We see that Hypertuning with Best Parameters identied by RandomizedSearchCV, of the GradientBoostClassifier has improved the recall to 1 on training set.
  2. Performance is consistent on Validation and Testing set as well.
  3. Accuracy and Precision scores are also consistent.

Compare the Model Performance

Observations:

  1. From the above comparision we see that all the hypertuned models give same recall.
  2. Accuracy and Precision is also consistent accross all datasets.
  3. All 6 models are giving consistent performance.
  4. XGBoostClassifier is tuned with maximum parameters using RandomizedSearchCV, we assume that this model will provide us better fitment and performance on future datasets.
  5. Let us also check the important features before proceeding.

Feature importance for RandomForest tuned with RandomizedSearchCV

Observations:

Total_Trans_Ct is the most important variable, followed by Total_Ct_Chng_Q4_Q1 and Total_Trans_Amt.

Pipelines for productionizing the model

Now, we have a final model. let's use pipelines to put the model into production

Column Transformer

  1. We know that we can use pipelines to standardize the model building, but the steps in a pipeline are applied to each and every variable - how can we personalize the pipeline to perform different processing on different columns
  2. Column transformer allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

Next steps:

  1. We will create 2 different pipelines, one for numerical columns and one for categorical columns
  2. For numerical columns, we will do missing value imputation as pre-processing
  3. For categorical columns, we will do one hot encoding and missing value imputation as pre-processing
  4. We are doing missing value imputation for the whole data, so that if there is any missing value in the data in future that can be taken care of.

Create pipelines for column imputation

Split independent and dependent columns

Split the data in training and test sets

We already have the model to be tuned, so we will need only Training and Testing Sets.

Create the new pipeline with best parameters

Check if the model runs on Test Set

Conclusion

  1. There was imbalance in data, as it contains only 16% attrited customers. This must be fixed for future analysis.
  2. The data capture issues resulted in Missing values in Marital Status and Education_Level, which are imputed with most frequent values in dataset. This adds to imbalance in the data.
  3. "abc" Category in Income_Level looks like typo for NULL Values and were also imputed using frequent values further creating one large segment of customers earning below $40 K. This should be avoided.
  4. The bank can run the model to achieve desired performance levels for new data, also to offer better services to customers.
  5. More data points to be added to dataset for better analysis in future.
  6. We tried building models for below combinations of variables, all of them gave similar performance on almost all models.
    • By dropping the variables that has Ratio and CLIENTNUM.
    • By dropping CLIENTNUM.
    • With all dependent variables. (Final dataset that was used)

Recommendations

  1. Business can focus on customer using Revolving Credit and curate specific offers for them to retain those customers.
  2. Business can also promote offers for Female Customers to reduce attrition.
  3. Customer satisfaction survey to be conducted for Platinum Cards as they contribute to most attrited customers.
  4. The company can run various campaigns and offers for customers with family to increase sales.
  5. Card categories other than Blue to be promoted more.
  6. Specific promotional and rewards program can be created for different income groups.
  7. Customers with Income more that $120K should also be focused upon to enhance their satisfaction to reduce the attrition.
  8. Customers with Doctorate degrees contribute more to the attrited customers. Business can use the Customer satisfaction survey data, implement findings and improve the retention ratio of such customers.
  9. The data collection process can be enhanced to capture additional information related to Customer Employment, Living Status and Spending Pattern. This will help in doing better missing value treatment and will also provide uniform data for analysis.
  10. Business has to also focus on customers that have not used their cards for more than 3 months in last one year. There are chances that these customers may close the card accounts.